{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Build a classification decision tree\n",
    "\n",
    "In this notebook we illustrate decision trees in a multiclass classification\n",
    "problem by using the penguins dataset with 2 features and 3 classes.\n",
    "\n",
    "For the sake of simplicity, we focus the discussion on the hyperparameter\n",
    "`max_depth`, which controls the maximal depth of the decision tree."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition note alert alert-info\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Note</p>\n",
    "<p class=\"last\">If you want a deeper overview regarding this dataset, you can refer to the\n",
    "Appendix - Datasets description section at the end of this MOOC.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "penguins = pd.read_csv(\"../datasets/penguins_classification.csv\")\n",
    "culmen_columns = [\"Culmen Length (mm)\", \"Culmen Depth (mm)\"]\n",
    "target_column = \"Species\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, we split the data into two subsets to investigate how trees predict\n",
    "values based on unseen data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "data, target = penguins[culmen_columns], penguins[target_column]\n",
    "data_train, data_test, target_train, target_test = train_test_split(\n",
    "    data, target, random_state=0\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In a previous notebook, we learnt that linear classifiers define a linear\n",
    "separation to split classes using a linear combination of the input features.\n",
    "In our 2-dimensional feature space, it means that a linear classifier finds\n",
    "the oblique lines that best separate the classes. This is still true for\n",
    "multiclass problems, except that more than one line is fitted. We can use\n",
    "`DecisionBoundaryDisplay` to plot the decision boundaries learnt by the\n",
    "classifier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "linear_model = LogisticRegression()\n",
    "linear_model.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import matplotlib as mpl\n",
    "import seaborn as sns\n",
    "\n",
    "from sklearn.inspection import DecisionBoundaryDisplay\n",
    "\n",
    "tab10_norm = mpl.colors.Normalize(vmin=-0.5, vmax=8.5)\n",
    "# create a palette to be used in the scatterplot\n",
    "palette = [\"tab:blue\", \"tab:green\", \"tab:orange\"]\n",
    "\n",
    "dbd = DecisionBoundaryDisplay.from_estimator(\n",
    "    linear_model,\n",
    "    data_train,\n",
    "    response_method=\"predict\",\n",
    "    cmap=\"tab10\",\n",
    "    norm=tab10_norm,\n",
    "    alpha=0.5,\n",
    ")\n",
    "sns.scatterplot(\n",
    "    data=penguins,\n",
    "    x=culmen_columns[0],\n",
    "    y=culmen_columns[1],\n",
    "    hue=target_column,\n",
    "    palette=palette,\n",
    ")\n",
    "# put the legend outside the plot\n",
    "plt.legend(bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n",
    "_ = plt.title(\"Decision boundary using a logistic regression\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the lines are a combination of the input features since they are\n",
    "not perpendicular a specific axis. Indeed, this is due to the model\n",
    "parametrization that we saw in some previous notebooks, i.e. controlled by the\n",
    "model's weights and intercept.\n",
    "\n",
    "Besides, it seems that the linear model would be a good candidate for such\n",
    "problem as it gives good accuracy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "linear_model.fit(data_train, target_train)\n",
    "test_score = linear_model.score(data_test, target_test)\n",
    "print(f\"Accuracy of the LogisticRegression: {test_score:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Unlike linear models, the decision rule for the decision tree is not\n",
    "controlled by a simple linear combination of weights and feature values.\n",
    "\n",
    "Instead, the decision rules of trees can be defined in terms of\n",
    "- the feature index used at each split node of the tree,\n",
    "- the threshold value used at each split node,\n",
    "- the value to predict at each leaf node.\n",
    "\n",
    "Decision trees partition the feature space by considering a single feature at\n",
    "a time. The number of splits depends on both the hyperparameters and the\n",
    "number of data points in the training set: the more flexible the\n",
    "hyperparameters and the larger the training set, the more splits can be\n",
    "considered by the model.\n",
    "\n",
    "As the number of adjustable components taking part in the decision rule\n",
    "changes with the training size, we say that decision trees are non-parametric\n",
    "models.\n",
    "\n",
    "Let's now visualize the shape of the decision boundary of a decision tree when\n",
    "we set the `max_depth` hyperparameter to only allow for a single split to\n",
    "partition the feature space."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "\n",
    "tree = DecisionTreeClassifier(max_depth=1)\n",
    "tree.fit(data_train, target_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DecisionBoundaryDisplay.from_estimator(\n",
    "    tree,\n",
    "    data_train,\n",
    "    response_method=\"predict\",\n",
    "    cmap=\"tab10\",\n",
    "    norm=tab10_norm,\n",
    "    alpha=0.5,\n",
    ")\n",
    "sns.scatterplot(\n",
    "    data=penguins,\n",
    "    x=culmen_columns[0],\n",
    "    y=culmen_columns[1],\n",
    "    hue=target_column,\n",
    "    palette=palette,\n",
    ")\n",
    "plt.legend(bbox_to_anchor=(1.05, 1), loc=\"upper left\")\n",
    "_ = plt.title(\"Decision boundary using a decision tree\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The partitions found by the algorithm separates the data along the axis\n",
    "\"Culmen Depth\", discarding the feature \"Culmen Length\". Thus, it highlights\n",
    "that a decision tree does not use a combination of features when making a\n",
    "single split. We can look more in depth at the tree structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import plot_tree\n",
    "\n",
    "_, ax = plt.subplots(figsize=(8, 6))\n",
    "_ = plot_tree(\n",
    "    tree,\n",
    "    feature_names=culmen_columns,\n",
    "    class_names=tree.classes_.tolist(),\n",
    "    impurity=False,\n",
    "    ax=ax,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"admonition tip alert alert-warning\">\n",
    "<p class=\"first admonition-title\" style=\"font-weight: bold;\">Tip</p>\n",
    "<p class=\"last\">We are using the function <tt class=\"docutils literal\">fig, ax = <span class=\"pre\">plt.subplots(figsize=(8,</span> 6))</tt> to create\n",
    "a figure and an axis with a specific size. Then, we can pass the axis to the\n",
    "<tt class=\"docutils literal\">sklearn.tree.plot_tree</tt> function such that the drawing happens in this axis.</p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We see that the split was done on the culmen depth feature. The original\n",
    "dataset was subdivided into 2 sets based on the culmen depth (inferior or\n",
    "superior to 16.45 mm).\n",
    "\n",
    "This partition of the dataset minimizes the class diversity in each\n",
    "sub-partitions. This measure is also known as a **criterion**, and is a\n",
    "settable parameter.\n",
    "\n",
    "If we look more closely at the partition, we see that the sample superior to\n",
    "16.45 belongs mainly to the \"Adelie\" class. Looking at the values, we indeed\n",
    "observe 103 \"Adelie\" individuals in this space. We also count 52 \"Chinstrap\"\n",
    "samples and 6 \"Gentoo\" samples. We can make similar interpretation for the\n",
    "partition defined by a threshold inferior to 16.45mm. In this case, the most\n",
    "represented class is the \"Gentoo\" species.\n",
    "\n",
    "Let's see how our tree would work as a predictor. Let's start with a case\n",
    "where the culmen depth is inferior to the threshold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_penguin_1 = pd.DataFrame(\n",
    "    {\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [15]}\n",
    ")\n",
    "tree.predict(test_penguin_1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The class predicted is the \"Gentoo\". We can now check what happens if we pass a\n",
    "culmen depth superior to the threshold."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_penguin_2 = pd.DataFrame(\n",
    "    {\"Culmen Length (mm)\": [0], \"Culmen Depth (mm)\": [17]}\n",
    ")\n",
    "tree.predict(test_penguin_2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this case, the tree predicts the \"Adelie\" specie.\n",
    "\n",
    "Thus, we can conclude that a decision tree classifier predicts the most\n",
    "represented class within a partition.\n",
    "\n",
    "During the training, we have a count of samples in each partition, we can also\n",
    "compute the probability of belonging to a specific class within this\n",
    "partition."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_pred_proba = tree.predict_proba(test_penguin_2)\n",
    "y_proba_class_0 = pd.Series(y_pred_proba[0], index=tree.classes_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_proba_class_0.plot.bar()\n",
    "plt.ylabel(\"Probability\")\n",
    "_ = plt.title(\"Probability to belong to a penguin class\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also compute the different probabilities manually directly from the\n",
    "tree structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "adelie_proba = 103 / 161\n",
    "chinstrap_proba = 52 / 161\n",
    "gentoo_proba = 6 / 161\n",
    "print(\n",
    "    \"Probabilities for the different classes:\\n\"\n",
    "    f\"Adelie: {adelie_proba:.3f}\\n\"\n",
    "    f\"Chinstrap: {chinstrap_proba:.3f}\\n\"\n",
    "    f\"Gentoo: {gentoo_proba:.3f}\\n\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It is also important to note that the culmen length has been disregarded for\n",
    "the moment. It means that regardless of its value, it is not used during the\n",
    "prediction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_penguin_3 = pd.DataFrame(\n",
    "    {\"Culmen Length (mm)\": [10_000], \"Culmen Depth (mm)\": [17]}\n",
    ")\n",
    "tree.predict_proba(test_penguin_3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Going back to our classification problem, the split found with a maximum depth\n",
    "of 1 is not powerful enough to separate the three species and the model\n",
    "accuracy is low when compared to the linear model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "tree.fit(data_train, target_train)\n",
    "test_score = tree.score(data_test, target_test)\n",
    "print(f\"Accuracy of the DecisionTreeClassifier: {test_score:.2f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, it is not a surprise. We saw earlier that a single feature is not able\n",
    "to separate all three species: it underfits. However, from the previous\n",
    "analysis we saw that by using both features we should be able to get fairly\n",
    "good results.\n",
    "\n",
    "In the next exercise, you will increase the tree depth to get an intuition on\n",
    "how such a parameter affects the space partitioning."
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "name": "python3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}